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Abstract 

Skip graphs are a novel distributed data structure, based on skip lists, that provide the full function- 
ality of a balanced tree in a distributed system where resources are stored in separate nodes that may fail 
at any time. They are designed for use in searching peer-to-peer systems, and by providing the ability to 
perform queries based on key ordering, they improve on existing search tools that provide only hash table 
functionality. Unlike skip lists or other tree data structures, skip graphs are highly resilient, tolerating a 
large fraction of failed nodes without losing connectivity. In addition, constructing, inserting new nodes 
into, searching a skip graph, and detecting and repairing errors in the data structure introduced by node 
failures can be done using simple and straightforward algorithms. 

1 Introduction 

Peer-to-peer networks are distributed systems without any central authority that are used for efficient loca- 
tion of shared resources. Such systems have become very popular for Internet applications in a short period 
of time. A survey of recent peer-to-peer research yields a slew of desirable features for a peer-to-peer systems 
such as decentralization, scalability, fault-tolerance, self-stabilization, data availability, load balancing, dy- 
namic addition and deletion of peer nodes, efficient and complex query searching, incorporating geography in 
searches and exploiting spatial as well as temporal locality in searches. The initial approaches, such as those 
used by Napster [Nap], Gnutella [Gnu] and Freenet [CSWHOO], do not support most of these features and are 
clearly unscalable either due to the use of a central server (Napster) or due to high message complexity from 
performing searches by flooding the network (Gnutella). The performance of Freenet is difficult to evaluate, 
but it provides no provable guarantee on the search latency and permits accessible data to be missed. 

Recent peer-to-peer systems like CAN [RFH+01], Chord [SMLN+03], Pastry [RD01], Tapestry [ZKJ01] 
and Viceroy [MNR02] use a distributed hash table (DHT) approach to overcome scalability problems. To 
ensure scalability, they hash the key of a resource to determine which node it will be stored at and balance 
out the load on the nodes in the network. The main operation in these networks is to retrieve the identity 
of the node which stores the resource, from any other node in the network. To this end, there is an overlay 
graph in which the location of the nodes and resources is determined by the hashed values of their identities 
and keys respectively. Resource location using the overlay graph is done in these various networks by using 
different routing algorithms. Pastry and Tapestry use the algorithm of Plaxton et aZ.[PRR99], which is based 
on hypercube routing: the message is forwarded deterministically to a neighbor whose identifier is one digit 
closer to the target identifier. CAN partitions a d-dimcnsional coordinate space into zones that are owned 
by nodes which store keys mapped to their zone. Routing is done by greedily forwarding messages to the 
neighbor closest to the target zone. Chord maps nodes and resources to identities of b bits placed around 
a modulo 2 b identifier circle and each node maintains links to distances 2°,2 1 . . . for greedy routing. With 
m machines in the system, most of these networks use O(logm) space and time for routing and O(logm) 
time for node insertion (with the exception of Chord that takes 0(log 2 m) time). Because hashing destroys 
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the ordering on keys, DHT systems do not support queries that seek near matches to a key or keys within 
a given range. 

Some of these systems try to optimize performance by taking topology into account. Pastry [RD01, 
CDHR02] and Tapestry [ZKJ01, ZJK02] exploit geographical proximity by choosing the physically closest 
node out of all the possible nodes with an appropriate identifier prefix. In CAN [RFH+01], each node 
measures its round-trip delay to a set of landmark nodes, and accordingly places itself in the co-ordinate 
space to facilitate routing with respect to geographic proximity. This last method is not fully self-organizing 
and may cause imbalance in the distribution of nodes leading to hot spots. Some methods to solve the 
nearest neighbor problem for overlay networks can be seen in [HKRZ02] and [KR02] . 

Some of these systems are partly resilient to random node failures, but their performance may be badly 
impaired by adversarial deletion of nodes. Fiat and Saia [FS02] present a network which is resilient to 
adversarial deletion of a constant fraction of the nodes; some extensions of this result can be seen in [SFG+02, 
Dat02]. However, they do not give efficient methods to dynamically maintain such a network. 

TerraDir [SBK02] is a recent system that provides locality and maintains a hierarchical data structure 
using caching and replication. There are as yet no provable guarantees on load balancing and fault tolerance 
for this system. 

1.1 Our approach 

The underlying structure of Chord, CAN, and similar DHTs resembles a balanced tree in which balancing 
depends on the near-uniform distribution of the output of the hash function. So the costs of constructing, 
maintaining, and searching these data structures is closer to the 0(logn) costs of tree operations than the 
0(1) costs of traditional hash tables. But because keys are hashed, DHTs can provide only hash table 
functionality. Our approach is to exploit the underlying tree structure to give tree functionality, while 
applying a simple distributed balancing scheme to preserve balance and distribute load. 

We describe a new model for a pecr-to-peer network based on a distributed data structure that we call 
a skip graph. This distributed data structure has several benefits: Resource location and dynamic node 
addition and deletion can be done in logarithmic time, and each node in a skip graph requires only logarithmic 
space to store information about its neighbors. More importantly, there is no hashing of the resource keys, 
so related resources are present near each other in a skip graph. This may be useful for certain applications 
such as prefetching of web pages, enhanced browsing, and efficient searching. Skip graphs also support 
complex queries such as range queries, i.e., locating resources whose keys lie within a certain specified 
range 1 . There has been some interest in supporting complex queries in peer-to-peer-systems, and designing 
a system that supports range queries was posed as an open question [HHH+02]. Skip graphs arc resilient 
to node failures: a skip graph tolerates removal of a large fraction of its nodes chosen at random without 
becoming disconnected, and even the loss of an O(j^i^) fraction of the nodes chosen by an adversary still 
leaves most of the nodes in the largest surviving component. Skip graphs can also be constructed without 
knowledge of the total number of nodes in advance. In contrast, DHT systems such as Pastry and Chord 
require a priori knowledge about the size of the system or its keyspace. 

The rest of the paper is organized as follows: we describe skip graphs and algorithms for them in detail 
in Sections 2 and 3. We describe the fault-tolerance properties and the repair mechanism for a skip graph 
in Sections 4 and 5. We discuss contention analysis and some recent related work in Sections 6 and 7 
respectively. Finally, we conclude in Section 8. 

1.2 Model 

We briefly describe the model for our algorithms. We assume a message passing environment in which all 
processes communicate with each other by sending messages over a communication channel. The system is 
partially synchronous, i.e., there is a fixed upper bound (time-out) on the transmission delay of a message. 

J Skip graphs support complex queries along a single dimension i.e., for one attribute of the resource, for example, its name 
key. 
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Processes can crash, i.e., halt prematurely, and crashes are permanent. We assume that each message takes 
at most unit time to be delivered and any internal processing at a machine takes no time. 



2 Skip graphs 



A skip list, introduced by Pugh [Pug90], is a randomized balanced tree data structure organized as a tower 
of increasingly sparse linked lists. Level of a skip list is a linked list of all nodes in increasing order by key. 
For each i greater than 0, each node in level i — 1 appears in level i independently with some fixed probability 
p. In a doubly-linked skip list, each node stores a predecessor pointer and a successor pointer for each list in 
which it appears, for an average of pointers per node. The lists at the higher level act as "express lanes" 
that allow the sequence of nodes to be traversed quickly. Searching for a node with a particular key involves 
searching first in the highest level, and repeatedly dropping down a level whenever it becomes clear that 
the node is not in the current level. Considering the search path in reverse shows that no more than j— — 

nodes are searched on average per level, giving an average search time of O flogn ^ _ p ^ og i ) with n nodes 

at level 0. Skip lists have been extensively studied [Pug90, PMP90, Dev92, KP94, KMP95], and because 
they require no global balancing operations are particularly useful in parallel systems [GMM96, GM97]. 
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Figure 1: A skip list with n = 6 nodes and |Togn] = 3 levels. 

We would like to use a data structure similar to a skip list to support typical binary tree operations on a 
sequence whose nodes are stored at separate locations in a highly distributed system subject to unpredictable 
failures. A skip list alone is not enough for our purposes, because it lacks redundancy and is thus vulnerable 
to both failures and congestion. Since only a few nodes appear in the highest-level list, each such node acts as 
a single point of failure whose removal partitions the list, and forms a hot spot that must process a constant 
fraction of all search operations. Skip lists also offer few guarantees that individual nodes are not separated 
from the rest even with occasional random failures. Since each node is connected on average to only O(l) 
other nodes, even a constant probability of node failures will isolate a large fraction of the surviving nodes. 

Our solution is to define a generalization of a skip list that we call a skip graph. As in a skip list, each 
of the n nodes in a skip graph is a member of multiple linked lists. The level list consists of all nodes in 
sequence. Where a skip graph is distinguished from a skip list is that there may be many lists at level i, 
and every node participates in one of these lists, until the nodes arc splintered into singletons after O(logn) 
levels on average. A skip graph supports search, insert, and delete operations analogous to the corresponding 
operations for skip lists; indeed, we show in Lemma 1 that algorithms for skip lists can be applied directly 
to skip graphs, as a skip graph is equivalent to a collection of n skip lists that happen to share some of their 
lower levels. 

Because there are many lists at each level, the chances that any individual node participates in some 
search is small, eliminating both single points of failure and hot spots. Furthermore, each node has 0(logn) 
neighbors on average, and with high probability no node is isolated. In Section 4 we observe that skip graphs 
are resilient to node failures and have an expansion ratio of i]( |o ^ n ) with n nodes in the graph. 

In addition to providing fault tolerance, having an f2(logn) degree to support O(logn) search time 
appears to be necessary for distributed data structures based on nodes in a one-dimensional space linked 
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by random connections satisfying certain uniformity conditions [ADS02]. While this lower bound requires 
some independence assumptions that are not satisfied by skip graphs, there is enough similarity between skip 
graphs and the class of models considered in the bound that an Q(\ogn) average degree is not surprising. 

We now give a formal definition of a skip graph. Precisely which lists a node x belongs to is controlled 
by a membership vector m(x). We think of m(x) as an infinite random word over some fixed alphabet, 
although in practice, only an O(logn) length prefix of m{x) needs to be generated on average. The idea of 
the membership vector is that every linked list in the skip graph is labeled by some finite word w, and a 
node x is in the list labeled by w if and only if w is a prefix of m(x). 




MEMBERSHIP 
VECTOR 



SKIP LIST 



LEVEL 2 



LEVEL 1 



LEVEL 



Figure 2: A skip graph with n = 6 nodes and [logn] = 3 levels. 

To reason about this structure formally, we will need some notation. Let E be a finite alphabet, let 
E* be the set of all finite words consisting of characters in E, and let E w consist of all infinite words. We 
use subscripts to refer to individual characters of a word, starting with subscript 0; a word w is equal to 
WQW1W2 ■ ■ ■■ Let \w\ be the length of w, with \w\ = oo if w G E w . If \w\ > i, write w \ i for the prefix of w 
of length i. Write e for the empty word. If v and w are both words, write v ^ w if v is a prefix of w, i.e., if 
w \ \v\ = v. Write Wi for the i-th character of the word w. Write w\ A wi for the common prefix (possibly 
empty) of the words w\ and w-i. 

Returning to skip graphs, the bottom level is always a doubly-linked list S ( consisting of all the nodes 
in order as shown in Figure 2. In general, for each w in E*, the doubly-linked list S w contains all x for 
which w is a prefix of m(x), in increasing order. We say that a particular list S w is part of level i if \w\ = i. 
This gives an infinite family of doubly-linked lists; in an actual implementation, only those S w with at least 
two nodes are represented. A skip graph is precisely a family {S w } of doubly-linked lists generated in this 
fashion. Note that because the membership vectors are random variables, each S w is also a random variable. 

We can also think of a skip graph as a random graph, where there is an edge between x and y whenever 
x and y are adjacent in some S w . Define x's left and right neighbors at level i as its immediate predecessor 
and successor, respectively, in 5 l m ( x )^, or _L if no such nodes exist. We will write xLi for a;'s left neighbor at 
level i and xRi for x's right neighbor, and in general will think of the Ri as forming a family of associative 
composablc operators to allow writing expressions like xRiR^_ 1 etc. We write x.maxLcvel for the first level 
£ at which x is in a a singleton list, i.e., x has at least one neighbor at level I — 1. 

An alternative view of a skip graph is a trie [dlB59, Fre60, Knu73] of skip lists that share their lower 
levels. If we think of a skip list formally as a sequence of random variables So, Si, S2, • • ., where the value of 
Si is the level i list, then we have: 

Lemma 1 Let {S w } be a skip graph with alphabet E. For any z E E", the sequence So , Si , S2 , . . ., where 
each Si = S z [i> is a skip list with parameter p = [E| . 

Proof: By induction on i. The list Sq equals S e , which is just the base list of all nodes. A node x 
appears in Si if m(x) \ i = z \ i; conditioned on this event occurring, the probability that x also appears 
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in Si+i is just the probability that m(x);+, = Zi+i. This event occurs with probability p = and it is 

easy to see that it is independent of the corresponding event for any other x 1 in Si- Thus each node in Si 
appears in Sj+i with independent probability p, and So, Si, . . . form a skip list. I 

For a node x with membership vector m(x), let the skip list S m M be called the skip list restriction of 
node x. 

2.1 Implementation 

In an actual implementation of a peer-to-peer system using a skip graph, each node in a skip graph will be 
a resource. The resources are sorted in increasing lexicographic order of their keys. Mapping these keys to 
actual physical machines can be done in two ways: In the first approach, wc make every machine responsible 
for the resources that it hosts. Alternatively, we use a DHT approach where we hash node identifiers and 
resource keys to determine which nodes will be responsible for which keys. The first approach gives security 
and manageability whereas the second one gives good load balancing. For now, we treat nodes in the skip 
graph as representing resources, and present our results without committing to how these resources are 
distributed across machines. Each node in a skip graph stores the address and the key of its successor and 
predecessor at each of the O(logn) levels. In addition, each node also needs O(logn) bits of space for its 
membership vector. 

In both of the above approaches, with n resources in the network, each machine is responsible for 
maintaining O(logn) links for each resource that it hosts, for a total of O(nlogn) links in the entire network. 
This is a much higher storage requirement than the 0(m log m) links for DHTs, where m is the number 
of machines in the system. Further, in our repair mechanism (described in Section 5), each machine will 
periodically check to see that its links are functional. This may result in a flood of messages given the high 
number of links per machine. It is an open question how to reduce the number of pointers in a skip graph 
and yet maintain the locality properties. 

3 Algorithms for a skip graph 

In this section, we describe the search, insert and delete operations for a skip graph. For simplicity, we refer 
to the key of a node (e.g. x.key) with the same notation (e.g. x) as the node itself. It will be clear from the 
context whether we refer to a node or its key. In the algorithms, we denote the pointer to x's successor and 
predecessor at level I as x. neighbor [R}[£] and x.neighbor[i] [£} respectively. We define xRg formally to be the 
value of x.neighbor[i?][^], if x. neighbor [R] [£} is a non-nil pointer to a non-faulty node, and _L otherwise. We 
define xLi similarly. We summarize the variables stored at each node in Table 1. 



Variable 


Type 


key 


Resource key 


neighbor [R] 


Array of successor pointers 


neighbor [L] 


Array of predecessor pointers 


m 


Membership vector 


maxLevel 


Integer 


deletcFlag 


Boolean 



Table 1: List of all the variables stored at each node. 

In this section, we only give the algorithms and analyze their performance; we defer the proofs of the 
correctness of the algorithms to Section 3.4. 
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3.1 The search operation 



The search operation (Algorithm 1) is identical to the search in a skip list with only minor adaptations to 
run in a distributed system. The search is started at the topmost level of the node seeking a key and it 
proceeds along each level without overshooting the key, continuing at a lower level if required, until it reaches 
level 0. Either the address of the node storing the search key, if it exists, or the address of the node storing 
the largest key less than the search key is returned. 



Algorithm 1: search for node v 

1 upon receiving (searchOp, startNode, searchKey, level): 

2 if (v. key = searchKey) then 

3 | send (foundOp, v) to startNode 

4 if (v. key < searchKey) then 
while level > do 

if ( (v. neighbor [R] [level]. key < searchKey) then 

send (searchOp, startNode, searchKey, level) to v. neighbor [R] [level 
break 



else level^level-1 



10 else 



n 

12 
13 
14 



while level > do 

if ((v. neighbor [L] [level]). key > searchKey) then 

send (searchOp, startNode, searchKey, level) to v. neighbor [L] [level 
break 

else level^level-1 



16 if (level < 0) then 

17 I send (notFoundOp, v) to startNode 



G 



Lemma 2 The search operation in a skip graph S with n nodes takes expected O(logn) messages and 
O(logn) time. 

Proof: Let S be the alphabet for the membership vectors of the nodes in the skip graph S, and z be 
the node at which the search starts. By Lemma 1, the sequence S m u) = So, Si, S2, ■ ■ where each Si = S z ii, 
is a skip list. A search that starts at z in the skip graph will follow the same path in S as in S m r z y So we 
can directly apply the skip list search analysis given in [Pug90], to analyze the search in S. With n nodes, 
on an average there will be 0(log ^ log (\/ p ) ) levels, for p = 1 511 1 1 . At most nodes are searched on average 
at each level, for a total of (9(log n ^_ p ^ io g (i/ p ) ) expected messages and 0(log n ( 1-p ) k> g (i/ p ) ) expected time. 
Thus, with fixed p, the search operation takes expected 0(log?i) messages and O(logn) time. I 

The network performance depends on the value of p = |E| _1 . As p increases, the search time decreases, 
but the number of levels increase, so each node has to maintain neighbors at more levels. Thus we get a 
trade-off between the search time and the storage requirements at each node. 

The performance shown in Lemma 2 is comparable to the performance of distributed hash tables, for 
example, Chord [SMLN+03]. With n resources in the system, a skip graph takes O(logn) time for one search 
operation. In comparison, Chord takes 0(log m) time, where m is the number of machines in the system. 
As long as n is polynomial in m, we get the same asymptotic performance from both DHTs and skip graphs 
for search operations. 

Skip graphs can support range queries in which one is asked to find a key > x, a key < x, the largest 
key < x, the least key > x, some key in the interval [x, y], all keys in [x, y], and so forth. For most of these 
queries, the procedure is an obvious modification of Algorithm 1 and runs in O(logn) time with O(logn) 
messages. For finding all nodes in an interval, we can use a modified Algorithm 1 to find a single clement 
of the interval (which takes O(logn) time and O(logn) messages). With r nodes in the interval, we can 
then broadcast the query through all the nodes (which takes O(logr) time and 0(r log n) messages). If the 
originator of the query is capable of processing r simultaneous responses, the entire operation still takes 
O(logn) time. 

3.2 The insert operation 

A new node u knows some introducing node v in the network that will help it to join the network. Node 
u inserts itself in one linked list at each level till it finds itself in a singleton list at the topmost level. The 
insert operation consists of two stages: 

1. Node u starts a search for itself from v to find its neighbors at level 0, and links to them. 

2. Node u finds the closest nodes s and y at each level £ > 0, s < u < y, such that m(u) \ (£ + 1) = 
m(s) I" (£ + 1) = m(y) \ (£ + 1), if they exist, and links to them at level I + 1. 

Because each existing node v does not require m(v)e+i unless there exists another node u such that 
m(v) I" {I + 1) = m{u) \ (£ + 1), it can delay determining its value until a new node arrives asking for its 
value; thus at any given time only a finite prefix of the membership vector of any node needs to be generated. 
Detailed pseudocode for the insert operation is given in Algorithm 2. Figure 3 shows a typical execution of 
an insert operation in a small skip graph with E = {0, 1}, where node u = 36 is inserted starting from node 
v = 13. 

Inserts can be trickier when we have to deal with concurrent node joins. Before u links to any neighbor, 
it verifies that its join will not violate the order of the nodes. So if any new nodes have joined the skip graph 
between u and its predetermined successor, u will advance over the new nodes if required before linking in 
the correct location. 

Lemma 3 The insert operation in a skip graph S with n nodes takes expected O(logn) messages and O(logn) 
time. 
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Figure 3: Inserting node 36 in a skip graph with £ = {0, 1}, starting from node 13. Messages are labeled 
by numbers in boxes in the order in which they are sent. Messages 1-3 implement node 36 determining the 
maximum level of node 13, and starting the search operation to find its neighbor at level 0. Messages 4-5 
implement the search operation, and node 33 informing node 36 that it is node 36's closest neighbor at level 

0. Messages 6-11 implement node 36 inserting itself between nodes 33 and 48 at level 0. Messages 12-15 
implement node 36 determining its neighbors at level 1, and inserting itself between nodes 33 and 48 at level 

1. Messages 16-19 implement node 36 determining its neighbors at level 2, and linking to node 33 at level 

2. Messages 20-21 implement node 36 determining its neighbors at level 3, finding that no neighbors exist, 
and completing its insert operation. 
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Algorithm 2: insert for new node u 



1 if (introducer = u) then 
u. neighbor [L][0] <— _L 
u. neighbor [R][0] <— _L 
u.maxLevel <— 
5 else 

if (introducer. key < u.key) then 
side <— i? 
other Side <— L 

else 

side <— L 
otherSidc <— i? 

send (getMaxLevelOp) to introducer 
wait until receipt of (retMaxLevelOp, maxLevel) 
send (searchOp, u, w.key, maxLevel-1) to introducer 

wait until f oundOp or notFoundOp is received 
upon receiving (foundOp, clone): 
terminate insert 

upon receiving (notFoundOp, otherSideNeighbor) : 
send (getNeighborOp, side, 0} to otherSideNeighbor 
wait until receipt of (retNeighborOp, sideNeighbor, 0): 
send (getLinkOp. u, side, 0} to otherSideNeighbor 
wait until receipt of (setLinkOp, newNcighbor, 0): 
u. neighbor [othcrSide] [0] <— newNcighbor 
send (getLinkOp, u, otherSide, 0) to sideNeighbor 
wait until receipt of (setLinkOp, newNeighbor, 0): 
u. neighbor [side] [0] <— newNeighbor 

while true do 

m{u)[ <— uniformly chosen random element of £ 
£^£+1 

if (u.neighbor[R][£ - I] ^ ±) then 

send (buddyOp, u, l — l, m(u)^_i, L) to w.neighbor[i?] [I — 1] 
wait until receipt of (setLinkOp, neighbor, £): 
w.neighbor[i?] [£] <— neighbor 

else u. neighbor [R][£] — _L 
if (u. neighbor [L][£ — then 

send (buddyOp, u, £ — 1, m{u)i-i, R) to u. neighbor [L] [£ — 1] 
wait until receipt of (setLinkOp, neighbor, £): 
u.neighbor[L] [£] <— neighbor 

else u. neighbor [L] [£} = _L 

if ((u.neighbor[R][£J — _L ) and (u. neighbor [L][£] — L)) then 
| break 

.maxLevel <— £ 
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Algorithm 3: Node v's message handler for messages received during the insert of new node u. 

1 upon receiving (getLinkOp, u, side, £}: 

2 change_ncighbor(it, side, £) 

3 upon receiving (buddyOp, u, £, val, side): 

4 if (side = L) then othcrSide <— R 

5 else othcrSide <— i 

6 if (m(v)e — -L) then 

7 

8 
9 



m(v)i *— uniformly chosen random element of £ 
v.neighbor[L][£] <— _L 
ii.neighbor[ii][l] <- 1 



10 if (m(v)e = val) then 

n | change_neighbor(u, side, ^ + 1) 

12 else 

13 if (v. neighbor [otherSide][£] ^ J-) then 

14 | send (buddyOp, it, val, £, side) to v. neighbor [otherSide] [£] 

15 else 

16 I send (setLinkOp, _L, £) to u 



Algorithm 4: change_neighbor(w, side, £) for node v 

1 if fside = R) then cmp <— < 

2 else cmp <— > 

3 if ((v. neighbor [side] [£]). key cmp u.key) then 

4 | send (getLinkOp, u, side, £) to v. neighbor [side] [£] 

5 else 

6 | send (setLinkOp, v, £) to u 

7 v. neighbor [side] [£} <— u 



Algorithm 5: Additional messages for node v 

1 upon receiving (updateOp, side, newNeighbor, I): 

2 u.neighbor[side] [£] *— newNeighbor 

3 upon receiving (getMaxLevelOp) from u: 

4 send (retMaxLevelOp, v.maxLevel) to u 

5 upon receiving (getNeighborOp, side, £) from u: 

6 send (retNeigliborOp, wside^) to u 
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Proof: Let £ be the alphabet for the membership vectors of the nodes in the skip graph S. With n 
nodes, there will be average of C^logn ^*^ ) levels in the skip graph, p — |£| _1 . To link at level 0, a new 
node u performs one search operation. From Lemma 2, this takes 0(log n io g (i/ p ) ) expected messages 
and 0(log n h_ p \ io g (i/ p ) ) expected time. At each level £, £ > 0, u communicates with an average of 2/p 
nodes, before it finds at most two nodes s and y, with m(s)g = m(u)e = m(y)e, s < u < y, and connects 
to them at level I + 1 . The expected number of messages and time for the insert operation at all levels 
is O ^ i g°^" p ) (t^jj §))' Thus with fixed p, the insert operation takes expected O(logn) messages and 
O(logn) time. I 

With m machines and n resources in the system, most DHTs such as CAN, Pastry and Tapestry take 
0(log7«) time for insertion; an exception is Chord which takes 0(log 2 m) time. An O(logm) time bound 
improves on the O(logn) bound for skip graphs when m is much smaller than n. However, the cost of this 
improvement is losing support for complex queries and spatial locality, and the improvement itself is only a 
constant factor unless some machines store a superpolynomial number of resources. 

3.3 The delete operation 

The delete operation is very simple. When node u wants to leave the network, it informs its predecessor 
node at each level to update its successor pointer to point to m's successor. It starts at the topmost level and 
works its way down to level 0. Node u also informs its successor node at each level to update its predecessor 
pointer to point to n's predecessor. If u's successor or predecessor are being deleted as well, they pass the 
message on to their neighbors so that the nodes are correctly linked up. A node does not delete itself from 
the graph as long as it is waiting for some message as a part of the delete operation of another node. 



Algorithm 6: delete for existing node u 
M.deleteFlag = true 
for I <— u.maxJevels downto do 
if u. neighbor [R][£] ^ _L then 

send (deleteOp, £, sender) to w.ncighbor[i?] [£} 

wait until receipt of (conf irmDeleteOp, £) or (noNeighborOp, £): 

upon receiving (noNeighborOp, £): 

if u. neighbor [L][£] ^ _L then 

send (setNeighborNilOp, £, sender) to u. neighbor [L] [£} 
wait until receipt of (conf irmDeleteOp, £) 



Lemma 4 The delete operation in a skip graph S with n nodes takes expected O(logn) messages and 0(1) 
time. 

Proof: Let S be the alphabet for the membership vectors of the nodes in the skip graph S. With n 
nodes, there will be average of 0(log ^ i og (\/ p ) ) levels in the skip graph, p = |S| _1 . At each level £, £ > 0, 
the node to be deleted communicates with at most two other nodes. It takes an average of 0(log n ^rn^ ) 
total messages and 0(1) time as the messages at all the levels can be sent in parallel. Thus with fixed p, a 
delete operation takes O(logn) messages and 0(1) time. I 

During the delete operation of node u, if u's successor or predecessor at some level are also being deleted, 
then the number of message at that level is proportional to the number of consecutive nodes being deleted. 
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Algorithm 7: Node v's message handler for messages received during the delete operation. 

1 upon receiving (deleteOp, £, sender): 

2 if (v.deleteFlag = true) then 

3 
4 



if (v. neighbor [R][£] ^ _L ) then 
| send (deleteOp, £, sender) to v. neighbor [R] 

else 

I send (noNeighborOp, I) to sender 



7 else 



8 
9 
10 
11 



send (f indNeighborOp, £, sender) to u.neighborfi] 
wait until receipt of (f oundNeighborOp, x, £): 
v. neighbor [L][£] «— x 
send (conf irmDeleteOp, £) to sender 



12 upon receiving (f indNeighborOp, £, sender): 

13 if (v.deleteFlag = true) then 



14 
15 



16 
17 



if (v. neighbor [L][£] ^= -L) then 

| send (f indNeighborOp, £, sender) to v. neighbor [L] [£] 
else 

I send (f oundNeighborOp, _L, £) to sender 



is else 



19 
20 



send (f oundNeighborOp, v, £) to sender 
v.neighbor[i?] [£] <— sender 



21 upon receiving (setNeighborNilOp, £, sender): 

22 if (v.deleteFlag = true) then 



23 
24 



25 
26 



if (v. neighbor [L J [£] ^ Jl) then 

| send (setNeighborNilOp, £, sender) to v. neighbor [L] 
else 

I send (conf irmDeleteOp, £) to sender 



27 else 



28 
29 



send (conf irmDeleteOp, £) to sender 
u.neighbor[iip <- 1 
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3.4 Correctness of algorithms 

In this section, we prove the correctness of the insert and delete algorithms given in Section 3. The definition 
of a skip graph in Section 2 involves global properties of the data structure (such as S w i being a subset of 
S w ) that are difficult to work with in the correctness proofs. So we start by defining a set of local constraints 
which characterize a skip graph. We first prove that a data structure is a skip graph if and if only if all these 
constraints are satisfied, and then we prove that these constraints are not violated after an insert and delete 
operation, thus maintaining the skip graph properties. Further, these constraints will be used for our repair 
mechanism as we can monitor the state of the graph by checking these constraints locally at each node, and 
detecting and repairing node failures. 

As explained in Section 3, we use _L both to refer to the null pointers at the ends of the doubly-linked 
lists of the skip graph, and to refer to pointers to failed nodes. 
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Let x be any node in the skip graph; then for all levels £ > 0: 

1. If xR e ^ _L, xRi > x. 

2. If xLf 7^ _L, xLi < x. 

3. If xRg ^ _L, xRiLg = x. 

4. If xLf 7^ _L, xL^Rg = x. 

5. If m(x) \ (l+l) = ?7i(xRf) \ (l+l) and < k,m(x) \ (£+1) = m{xR j t ) \ (£+1), then xR t+1 = xR\. 
Else, xRi + i = -L. 

6. If m(x) \ {£+!) = m(xL h t ) \ (£+1) and $j,j < k,m{x) \ {£+1) = m{xL\) \ {£+1), thenar^+i = xh\. 
Else, xLg + i = _L. 

As per the definition of a skip graph given in Section 2, all the elements in a doubly linked list S w (which 
contains all x for which w is a prefix of m(x) of length £) are in increasing order. Constraints 1 through 4 
imply that all non-edge nodes satisfy the increasing order in the linked lists. Constraints 1 and 2 ensure 
that the order is locally true at every node, whereas Constraints 3 and 4 ensure that the entire list is doubly 
linked correctly. The constraints that the increasing order is satisfied locally at each node and that the list 
is doubly-linked correctly, put together ensure that no element is skipped over and that the entire list is 
sorted. 

Constraints 5 and 6 denote how the lists at different levels are related to each other. The successor 
(predecessor) of node x at level £ + 1 is always the first node to its right (left) at level £ whose membership 
vector matches the membership vector of x in one additional position. Node x is connected at level £ + 1, 
on the right side to a node z such that x, z £ S w , and z is the nearest node greater than x in S w with 
m(x)i = m(z)i. Similarly, x is connected at level £ + 1, on the left side to a node u such that u, x £ S w , and 
u is the nearest node less than x in S w with m(u)i = m(x)i. 

Define a defective skip graph as a data structure that that contains skip graph elements but does not 
satisfy the definition of a skip graph; for example, it may contain out-of-order elements, missing links, or 
worse. 

Theorem 5 Every connected component of the data structure is a skip graph if and only if Constraints 1 — 6 
are satisfied. 

Proof: We start with the reverse direction: if the constraints are not satisfied, then some connected 
component of the data structure is not a skip graph. As Constraints 2, 4 and 6 are mirror images of 
Constraints 1, 3 and 5 respectively, we will only consider violations of Constraints 1, 3 and 5. 

Figure 4 shows how Constraints 1 and 3 can be violated. Each violation leads to either an unsorted or 
inconsistently linked list at level £, so the data structure is not a skip graph. There are two ways in which 
Constraint 5 can be violated. 

1. For some x and £, xRi+i = xR\ but 3j,j < k, m(x) \ {£ + 1) = m{xR 3 i ) \ (£ + 1). 

Let y — xR\ and z = xR\. As the linked list is sorted at level £, j < k y < z, and since 
y = xRj,x < y. Let S w = {x\m{x) \ (£ + 1) = w}. Then in a skip graph, x,y and z G S w . Since 
y 7^ xRg+i, either y ^ S w or the linked list at level £ + 1 is not sorted as x < y < z. In both cases, the 
resulting data structure is a defective skip graph. 

2. For some x and £, xRi+i ^ xR\. 

As m{x) \ (£ + 1) = m(xRi + i) \ (£ + 1), m(x) \ £ = m(xRi + i) \ £. It follows that both x and xRg + i 
are in S m ( x ^£. But then if xRi+\ ^ xR\ for any k, some edge in S m i x \^i is missing, and the data 
structure is a defective skip graph. 
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(a) Violation of Constraint 1: xRi > x. 



(b) Violation of Constraint 3: xRgLg = x. 



Figure 4: Violation of Constraints 1 and 3. 



Now we prove that every connected component of a data structure is a skip graph if all the constraints are 
satisfied. We first prove that we have sorted, doubly-linked lists at all levels using Constraints 1 through 4, 
and then prove that each list contains the correct elements as per their membership vectors using Con- 
straints 5 and 6. 

Let x be an arbitrary element of the data structure. Let S x i be the maximal sequence of the form 
xL J ( , . . . , x, . . . , xRg, where j, k > 0, such that no element of the sequence is _L. We show that this sequence is 
sorted and doubly-linked using induction. According to Constraint 1, xRg > x, and according to Constraint 3, 
xRiLi — x. Thus the sequence x,xRg is sorted and doubly-linked. Similarly, according to Constraint 2, 
xLi < x, and according to Constraint 4, xLgRg = x. Thus the sequence xLg,x is sorted and doubly-linked. 
Let the sequence xLl~~ , . . . , x, . . . , xR%~ x , be a sorted, doubly-linked list. According to Constraints 1 and 3, 
xR\ > xRg~ x , and xR 1 ^ 1 RgLg — xR\~ x . Similarly, according to constraints 2 and 4, xL? t < xL 3 .~ , and 
xLj 1 R?L( = xL\ 1 . Thus the maximal sequence S x ,t = xL^, . . . , x, . . . , xR\ is sorted and doubly-linked. 

We now show that if two nodes are connected at level £, they are also connected at level 0. Suppose that 
Constraints 5 and 6 hold. Then, we prove that for each level t > 0, xRi = xR J and xLi = xL J . Clearly 
this is true for £ = and j ' = 1. Suppose that it is true for level £ — 1. Let x = yo and let each xR\_ 1 = yi, 
1 < i < k. For each i, y t = y i ^ 1 R e ^ 1 = yi-iRft . So y\ = y Re-i = yoRo°, Vi = ViRe-i = ViRq 1 = yoRa' +Jl , 
and so on. Thus, y k = y R^ +n+:i2+ "' +: ' h = yoR J , where j = j + -+jk- But y k = xR^ and y = x. 
So xRg—i = xRq. According to Constraint 5, xRg = xR^_ 1 . Thus we get xRg = xR J . A similar proof will 
show that xLe = xqLq. 

We use the proof above to show that any two connected nodes are connected in the same list at level 
0. Consider a path xE\E^ . . . E k y where each E{ is either Lg t or Rg j . As proved above, there exists a path 
xE^E^ 2 ■ ■ ■ E' k 3k y where each E[ is either Lq or i?o- Thus it follows that x and y are in the same list at 
level 0. Also S Xt o = S y .o if x and y are in the same connected component. So we get a single list S e at level 
0, which consists of all the elements in the same connected component of the data structures. 

As proved above, S e is also sorted and doubly-linked. With the single list S e at level 0, according to 
Constraints 5 and 6, each node i g S £ , is linked to its right and left at level 1 to the nearest nodes z and 
u respectively (if they exist), such that m(x) \ 1 = m,(z) \ 1 = m(u) f 1, and u < x < z. Thus, we get 
|E| linked lists at level 1, S a = {y\m(y)o = a}, one for each a G S. In general, at level £, we can get up to 
S| £ lists, one for each w £ T, e . Each list contains all the nodes which have the matching membership vector 
prefix. As proved above, each of these lists is also sorted and doubly-linked. Thus, if the data structure 
satisfies all the constraints, it is a skip graph. I 

Lemma 6 Inserting a new node u in a skip graph S using Algorithm 2 gives a skip graph. 

Proof: Inserting a new node u in S consists of two stages: inserting u in level using a search 
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operation, and inserting u in levels £ > using the neighbors of u at level £ — 1. We consider the case 
where the introducing node's key is less than u's key; the other case is similar so we omit those details here. 
Also, we only prove that Constraints 1, 3 and 5 are satisfied as Constraints 2, 4 and 6 are mirror images of 
Constraints 1, 3 and 5 respectively. 

The search operation started by u (line 14 of Algorithm 2) returns the largest node s less than u (line 17 of 
Algorithm 1), and u sends a getLinkOp message to s (line 21 of Algorithm 2). One of the following two cases 
occur: Either s sets sR' Q — u > s (line 7 of Algorithm 4), maintaining Constraint 1. Or, if additional nodes 
are inserted between u and s, and sRq < u (line 3 of Algorithm 4), then s passes the getLinkOp message 
to si?o- As the getLinkOp message is only passed to nodes whose key is less than that of u, eventually it 
reaches some node t where the message terminates and t' Rq = u > t, maintaining Constraint 1. Also as 
u sets uL' = s < u or uL' Q = t < u' (line 23 of Algorithm 2), either sR' Q L' Q = s or tR' L' = t satisfying 
Constraint 3. In the absence of a suitable s, no pointers are changed and Constraints 1 and 3 are satisfied 
as the pointer values remain unchanged from before the insert. 

Node u also determines the initial right neighbor of s, say z (lines 19 and 20 of Algorithm 2) and sends 
a getLinkOp message to z (line 24 of Algorithm 2). Similar to the earlier getLinkOp message, either z 
sets z'Lq — u < z (line 7 of Algorithm 4), or passes the message on to zLq if it is greater than u (line 3 
of Algorithm 4). As the getLinkOp message is only passed to nodes whose key is greater than that of 
u, eventually it reaches some node y where the message terminates and y'Lo = u < y. Node u sets 
u' R = z > u or u'Rq = y > u (line 26 of Algorithm 2), maintaining Constraint 1. As z sets zL' = u < z or 
y sets yL' Q — u < y (line 7 of Algorithm 4), uR' Q L' Q — u satisfying Constraint 3. In the absence of a successor, 
u simply sets uR' = _L (line 2 or 26 of Algorithm 2), thus trivially satisfying Constraints 1 and 3. 

Node u uses its neighbors at level £, (£>0), to find its neighbors at level £ + 1. Node u sends a buddy Op 
message to uRg > u (line 32 of Algorithm 2), and this message is passed to the right to successive nodes 
uR\ > u, k > 1, until it reaches a node y such that m{u) \ (£+1) = m(y) \ (£+ 1), and u sets uR' e+1 = y > u 
(line 34 of Algorithm 2), satisfying Constraint 1. As this message is only sent to the right, u can only connect 
on its right to nodes greater than itself. As y sets yL' e+1 = u < y, uR' e+1 L' e+1 = u, satisfying Constraint 3. 
Similarly, u also sends a buddyOp messages to its left to uL(, < u (line 37 of Algorithm 2); as this message is 
only sent to nodes s less that u, it ensures that sR' e+1 = u > s, satisfying Constraint 1. As u sets uL' l+l = s, 
sR' l+1 L' l+1 = s satisfying Constraint 3. 

As u only queries the nodes z in the same list as itself at level £, it is ensured that m(u) \ I = m(z) \ £. 
Further, we see that u only links to a node z such that m{u)i = m(z)i (line 10 of Algorithm 3). Thus u can 
only link to z at level (£ + 1) if m(u) \ (l + l) = m(z) \ (I + 1). Having seen that u only links to nodes with 
the correct membership vector prefix, it only remains to show that u links to the nearest such nodes at each 
level £ > 0. We see that u starts looking for it successor at level £+1 from uRg (line 31 of Algorithm 2). 
Node uRe is either the successor for node u at level £+1 (line 10 of Algorithm 3), or it passes the message to 
its successor (line 14 of Algorithm 3). As the search for uRe+i proceeds one node at a time along 5 m („)|-(^), 
it is guaranteed to find the nearest node greater than u, whose membership vector matches m(u)t. Thus 
uRi + i = uR*, for the smallest k > 0, satisfying Constraint 5. 

We note that with concurrent inserts, additional nodes may get linked at some level between u and its 
predetermined neighbors, found using either the search operation (for level 0) or the buddyOp messages (for 
levels greater than 0). In each case, we see that when some old node receives a getlinkOp messages to link 
to u, it verifies that pointing to u will maintain the skip graph node order. Otherwise, it passes the message 
to its appropriate neighbor (line 4 of Algorithm 4). This is explained in detail above, and it ensures that u 
links to the correct nodes at each level. So the constraints are maintained even with concurrent inserts in 
the skip graph. 

Thus when all the concurrent insert operations are completed, we get a skip graph. I 

Lemma 7 Deleting node u from a skip graph S using Algorithm 6 gives a skip graph. 

Proof: Deleting a node u from a skip graph S consists of two stages at each level £: finding a node to 
the right of u that is not being deleted, and then finding a node to the left of u that is not being deleted to 
link these two nodes together. 
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Node u sends a deleteOp message to its successor uRg (line 4 of Algorithm 6). As long as this message is 
received by a node that is itself being deleted (line 2 of Algorithm 7) , it is passed on to the right to successive 
nodes uR^,uR\ . . . (line 4 of Algorithm 7), until it reaches some node z > u which is not being deleted. 
Node z sends a f indNeighborOp message to zLg to determine its new left neighbor (line 8 of Algorithm 7). 
As long as this message is received by a node that is itself being deleted (line 13 of Algorithm 7), it is passed 
on to the left to successive nodes zL'j, zL\ . . . (line 15 of Algorithm 7), until it reaches some node s < u < z 
which is not being deleted. Node s sends a f oundNeighborOp message back to z and sets sR[ = z > s, 
satisfying Constraint 1 (lines 19 and 20 of Algorithm 7). Upon receipt of this message (line 9 of Algorithm 7), 
z sets zL' t = s < z (line 10 of Algorithm 7), satisfying Constraint 2. Further, as zL' e R' e = z and sR!JJ t = s, 
Constraints 3 and 4 are also satisfied. Nodes z and s are in the same list at level £, so m(z) \ £ = m(s) \ £, 
and they are also in the same list at level I — 1. As all the nodes between them will be eventually deleted, 
they are nearest nodes with matching membership vectors to be linked at level £, satisfying Constraints 5 
and 6. 

If no suitable node s exists, the last node on the left of z that is being deleted, sends a f oundNeighborOp 
to z (line 17 of Algorithm 7). Node z sets zL' e = _L (line 10 of Algorithm 7), thus trivially satisfying 
Constraints 2, 4 and 6. If no suitable node z exists, the last node to the right of u that is being deleted, 
informs u of that (line 6 of Algorithm 7). Node u sends a setNeighborNilOp message to uLi (line 8 of 
Algorithm 6). which is passed to the left until it reaches a node q that is not being deleted (line 24 of 
Algorithm 7). Node q sets qR' e = _L (line 29 of Algorithm 6), once again trivially satisfying Constraints 1, 3 
and 5. If no suitable node q exists, then no link changes are made at all, and all the constraints are satisfied 
as before. 

We note that with concurrent deletes, additional nodes may get deleted at some level between u and its 
existing neighbors. We see that when some neighboring node receives a message for the delete operation, it 
verifies that it is not being deleted itself. If so, it passes the message on to its neighbor as explained in detail 
above. This ensures that only nodes on either side of u that are not being deleted link to each other. 

Thus, after all the concurrent delete operations have been completed, we get a skip graph. I 

4 Fault tolerance 

In this section, we describe some of the fault tolerance properties of a skip graph with alphabet {0, 1}. Fault 
tolerance of related data structures, such as augmented versions of linked lists and binary trees, has been 
well-studied and some results can be seen in [MP84, AB96]. In Section 5, we give a repair mechanism that 
detects node failures and initiates actions to repair these failures. Before we explain the repair mechanism, 
we are interested in the number of nodes that can be separated from the primary component by the failure 
of other nodes, as this determines the size of the surviving skip graph after the repair mechanism finishes. 

Note that if multiple nodes are stored on a single machine, when that machine crashes, all of its nodes 
vanish simultaneously. Our results are stated in terms of the fraction of nodes that are lost; if the nodes are 
roughly balanced across machines, this will be proportional to the fraction of machine failures. Nonetheless, 
it would be useful to have a better understanding of fault tolerance when the mapping of resources to 
machines is taken into account; this may in fact dramatically improve fault tolerance, as nodes stored on 
surviving machines can always find other nodes stored on the same machine, and so need not be lost even if 
all of their neighbors in the skip graph are lost. 

We consider two fault models: a random failure model in which an adversary chooses random nodes to 
fail, and a worst-case failure model in which an adversary chooses specific nodes to fail after observing the 
structure of the skip graph For a random failure pattern, experimental results, presented in Section 4.1, 
show that for a reasonably large skip graph nearly all nodes remain in the primary component until about 
two-thirds of the nodes fail, and that it is possible to make searches highly resilient to failures even without 
using the repair mechanism, by the use of redundant links. For a worst-case failure pattern, theoretical 
results, presented in Section 4.2, show that even a worst-case choice of failures causes limited damage. With 
high probability, a skip graph with n nodes has an SXr^n) ex P ans i° n ratio, implying that at most 0(f -logn) 
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nodes can be separated from the primary component by / failures. We do not give experimental results for 
adversarial failures as experiments may not be able to identify the worst-case failure pattern. 



4.1 Random failures 

In our simulations, skip graphs appear to be highly resilient to random failures. We constructed a skip graph 
of 131072 nodes, where each node was had a unique label from [1, 131072]. We progressively increased the 
probability of node failure and measured the size of largest connected component of the live nodes as well as 
the number of isolated nodes as a fraction of the total number of nodes in the graph. As shown in Figure 5, 
nearly all nodes remain in the primary component even as the probability of individual node failure exceeds 
0.6. We also see that a lot of nodes are isolated as the failure probability increases because all of their 
immediate neighbors die. 
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Figure 5: The number of isolated nodes and the size of the primary component as a fraction of the surviving 
nodes in a skip graph with 131072 nodes. 




Figure 6: Fraction of failed searches in a skip graph with 131072 nodes and 10000 messages. Each node has 
up to five successors at each level. 
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For searches, the fact that the average search involves only O(logn) nodes establishes trivially that most 
searches succeed as long as the proportion of failed nodes is substantially less than 0{^^). By detecting 
failures locally and using additional redundant edges, we can make searches highly tolerant to small numbers 
of random faults. 

Some further experimental results are shown in Figure 6. In these experiments, each node had additional 
links to up to five nearest successors at every level. A total of 10000 messages were sent between randomly 
chosen source and destination nodes, and the fraction of failed searches was measured. We see that skip 
graphs are quite resilient to random failures. This plot appears to contradict the one shown in Figure 5, 
because we would expect all the searches to succeed as long as all live nodes are in the same connected 
component. However, once the source and target nodes are fixed, there is a fixed, deterministic path along 
which the search proceeds and if any node on this path fails, the search fails. So there may be some path 
between the source and the destination nodes, putting them in the same connected component, but the path 
used by the search algorithm may be broken, foiling the search. This suggests that if we use smarter search 
techniques, such as jumping between the different skip lists that a node belongs to, we can get much better 
search performance even in the presence of failures. 

In general, skip graphs do not provide as strong guarantees as those provided by data structures based on 
explicit use of expanders such as censorship-resistant networks [FS02, SFG + 02, Dat02]. But we believe that 
this is compensated for by the simplicity of skip graphs and the existence of good distributed mechanisms 
for constructing and repairing them. 

4.2 Adversarial failures 

In addition to considering random failures, we are also interested in analyzing the performance of a skip 
graph when an adversary can observe the data structure, and choose specific nodes to fail. Experimental 
results may not even be able to identify these worst-case failure patterns. So in this section, we look at the 
expansion ratio of a skip graph, as that gives us the number of nodes that can be separated from the primary 
component even with adversarial failures. 

Let G be a graph. Recall that the expansion ratio of a set of nodes A in G is [ <5^4 1 / 1 ^4 1 , where \SA\ is the 
number of nodes that are not in A but are adjacent to some node in A. The expansion ratio of the graph 
G is the minimum expansion ratio for any set A, for which 1 < \A\ < n/2. The expansion ratio determines 
the resilience of a graph in the presence of adversarial failures, because separating a set A from the primary 
component requires all nodes in 8 A to fail. We will show that skip graphs have i]( lo ^ n ) expansion ratio with 
high probability, implying that only 0(f ■ logn) nodes can be separated by / failures, even if the failures are 
carefully targeted. 

Our strategy for showing a lower bound on the expansion ratio of a skip graph will be to show that with 
high probability, all sets A either have large 5qA (i.e., many neighbors at the bottom level of the skip graph) 
or have large 5gA for some particular I chosen based on the size of A. Formally, we define SgA as the set of 
all nodes that are not in A but are joined to a node in A by an edge at level I. Our result is based on the 
observation that 8 A = {} l S^A and \SA\ > max^ \S^A\. We begin by counting the number of sets A of a given 
size that have small SqA. 

Lemma 8 In a n-node skip graph with alphabet {0, 1}, the number of sets A, where \A\ = m < n and 
Ml < s, is less than i?^ 1 ) . 

Proof: Without loss of generality, assume that the nodes of the skip graph are numbered from 1 to n. 
Given a subset A of these nodes, define a corresponding bit-vector x by letting Xi = 1 if and only if node i 
is in A. Then SqA corresponds to all zeroes in x that are adjacent to a one. 

Consider the extended bit-vector x' = lxl obtained by appending a one to each end of x. Because x' 
starts and ends with a one, it can be divided into alternating intervals of ones and zeroes, of which r + 1 
intervals will consist of ones and r will consist of zeroes for some r, where r > since x contains at least one 
zero. Observe that each interval of zeroes contributes at least one and at most two of its endpoints to 5qA. 
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It follows that r < \SqA\ < 2r, and thus any A for which \SoA\ < s corresponds to an x for which x' contains 
t < l^o-^l < s intervals of zeroes. 

Since there is at least one A with r < s but \8qA\ > s, the number of sets A with \5oA\ < s is strictly less 
than the number of sets A with r < s. By counting the latter quantity, we get a strict upper bound on the 
former. 

We now count, for each r, the number of bit-vectors x' with n — m zeroes consisting of r + 1 intervals of 
ones and r intervals of zeroes. Observe that we can characterize such a bit-vector completely by specifying 
the nonzero length of each of the r + 1 all-one intervals together with the nonzero length of each of the r 
all-zero intervals. There are m + 2 ones that must be distributed among the r + 1 all-one intervals, and there 
ways to do so. Similarly, there are n — m zeroes to distribute among the r all-zero 



arc 



(m+2-U 
\ r+1-1 > 



( m+l\ 



intervals, and there are (" r m 1 ) ways to do so. Since these two distributions are independent, the total 

count is exactly ( n ;™ _1 ) . 

Summing over all r < s then gives the upper bound 5Zr=i C™^ 1 ) (" r^ 1 )' " 

For levels £ > 0, we show with a probabilistic argument, that \8iA\ is only rarely small 

Lemma 9 Let A be a subset of m < n/2 nodes of a n-node skip graph S with alphabet {0, 1}. Then for any 
I, Pr[|M <\-2 e ] <2( Li 2 ^j)(2/3r. 

Proof: The key observation is that for each b in {0, if A contains a node u with m(u) \ I = b and 
A's complement S — A contains a node v with m(v) \ i = b, then there exist nodes ii'eA and v 1 G S — A 
along the path from u to v in Sb, such that u' and v' are adjacent in Sb- Furthermore, since such pairs are 
distinct for distinct &, wc get a lower bound on 5iA by computing a lower bound on the number of distinct 
b for which A and S — A both contain at least one node in Sb- 

Let T{A) be the set of & £ {0, l} e for which A contains a node of Sb, and similarly for T(S — A). Then 



Pr 



\T(A)\<--2 e 



BCS,|B|=L|-2 { J 

< - L| 2 ^)(2/3)^, 



and by the same reasoning, 



Pr 



\T(S-A)\<- 



< 



L!-*J 



(2/3) 



\S-A\ 



But if both T(A) and T(S — A) hit at least two-thirds of the b, then their intersection must hit at least 
one-third, and thus the probability that T(A)nT(S — A) < i -2 l is at most (i 2 2 ^j) ((2/3) |A| + (2/3) |s " /l1 ) , 

which is in turn bounded by 2(^2 2 2<! j) (2/3)'^' under the assumption that \A\ < \S — A\. I 

We can now get the full result by arguing that there are not enough sets A with small \SqA\ (Lemma 8) to 
get a non-negligible probability that any one of them has small \6eA\ for an appropriately chosen I (Lemma 9). 
Details are given in the proof of Theorem 10 below. 

Theorem 10 Let c > 6. Then a skip graph with n nodes and alphabet {0, 1}, has an expansion ratio of at 
least 



ia/2 > 



with probability at least 1 — an c , where the constant factor a does not depend on c. 



Proof: We will show that the probability that a skip graph S with n nodes does not have the given 
expansion ratio is at most cm 5-c , where a = 31. The particular value of a = 31 may be an artifact of our 
proof; the actual constant may be smaller. 
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Consider some subset A of S with \A\ = m < n/2. Let s 



mlg(3/2) 
:1k 



, and let s\ = |~s~|. We 



l °S3/2 n cl s n 

wish to show that, with high probability, all A of size m have |<L4| > s\. We will do so by counting the 
expected number of sets A with smaller expansion. Any such set must have both |<5o^4| < s and \8iA\ < s 
for any £; our strategy will be to show first that there are few sets A in the first category, and that each set 
that does have small 5qA is very likely to have large SiA for a suitable £. 

By Lemma 8, there are at most X^" 1 ( m ^ 1 )( n ~^ 1 ) sets A of size m for which \SoA\ < si. It is not 
hard to show that the largest term in this sum dominates. Indeed, for r < si, the ratio between adjacent 
terms ( n ;™" 1 )/(7+ 1 1 ) ( n "™ " X ) equals ■ and since r < ±m and m < f , this product is 

easily seen to be less than i. It follows that 



si-l 

E 



m + 1 

r 



n — m — 
r - 1 



< 



< 
< 



m 
si 



si 



,2(^1-1) 



n — m — 1 
s x -2 

n — to — 1 
ai-2 



i=0 



,2s 



- >°S3/2 ' 



Now let £ = [lg s + lg 3] , so that s < \2 L < 2s. 
Applying Lemma 9, we have 

Pr[|M|<s|] < Pr 

< 2 

< 2 

< 2 



and thus 



\S t A\ < 3 ■ T 
2 e 

6s 
2s + l y 

2s+1 (2/3) m , 



(2/3) 1 
(2/3) m 



(6s) 2 



E 

AcS,|A|=m,|(5o/l|<s 



Pr[|<JM| < a\] < 2 1 



2- (6s) 2s+1 (2/3) m . 



(1) 



Taking the base-2 logarithm of the right-hand side gives 
2mlg(3/2) 



l + (2s + l)lg(6s)-mlg(3/2) 



< 



2mlg(3/2) 



V c hn J 
l) lg(3/2). 



= 1 + TO 



It follows that the probability that there exists an A of size to, for which both \5qA\ and \SiA\ are less than 
s, is at most 2 to the above quantity, which we can write as 2b m where b = (3/2)(' 5 / c_1 '. 

To compute the probability that any set has a small neighborhood, we sum over to. By definition of the 
expansion ratio, we need only consider values of to less than or equal to n/2; however, because every proper 
subset A of S has at least one neighbor, we need to consider only m > clog 3 / 2 n. So we have 



PI- 



S' has expansion ratio less than 



clog 3/2 n 



n/2 

E 2& " 

m=\c log 3/2 n] 
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1 - b 



where the last inequality follows from the assumption that c > 6. I 

5 Repair mechanism 

Although a skip graph can survive a few disruptions, it is desirable to avoid accumulating errors. A large 
number of unrepaired failures will degrade the ideal search performance of a skip graph. Replication alone 
may not guarantee robustness, and we need a repair mechanism that automatically heals disruptions. Fur- 
ther, as failures occur continuously, the repair mechanism needs to continuously monitor the state of the 
skip graph to detect and repair these failures. The goal of the repair mechanism is to take a defective skip 
graph and repair all the defects. 

We describe the repair mechanism as follows: In Section 5.1, we show that the first two skip graph 
constraints, given in section 3.4, are always preserved in any execution of the skip graph. In Section 5.2, 
we show how the remaining constraints can be checked locally by every node, and give algorithms to repair 
errors which may exist in the data structure. In Section 5.3, we prove that the repair mechanism given in 
Section 5.2 repairs a defective skip graph to give a defectless one. We note that the repair mechanism is 
not a self-stabilization mechanism in the strong sense because it will not repair an arbitrarily linked skip 
graph and restore it to its valid state. Instead, we see that certain defective configurations are impossible 
given the particular types of failures we consider. Thus the repair mechanism only repairs those failures that 
occur starting from a defectless skip graph. 



5.1 Maintaining the invariant 

We again list the constraints that describe a skip graph, as given in Section 3.4. Let x be any node in the 
skip graph; then for all levels £ > 0: 

1. If xRi 7^ _L, xRg > x. 

2. If xLi 7^ _L, xL( < x. 

3. If xRi 7^ _L, xRiLi = x. 

4. If xLi =/= _L, xLfRi = x. 

5. If m(x) \ (£+1) = m{xR\) \ (£+1) and < k,m{x) \ (£+1) = m(xR{) \ (£+1), then xR e+1 = xR\. 
Else, xRg + i = JL. 

6. Ifm(x) \ {£+!) = m{xL k t ) \ (£+1) and < k,m(x) \ (£+1) = m(xL\) \ (£+1), then xL e+1 = xL\. 
Else, xLi + \ = _L. 



22 



We define Constraints 1 and 2 as an invariant for a skip graph as they hold in all states even in the 
presence of failures. Constraints 3 to 6 may fail to hold with failures, but they can be restored by the 
repair mechanism. We call Constraints 3 and 4 the R and L backpointer constraints respectively, and 
Constraints 5 and 6 the R and L inter-level constraints respectively. Each node periodically checks to see 
if its backpointer or inter-level constraints have been violated. If it discovers an inconsistent constraint, it 
initiates the repair mechanism explained in Section 5.2. 

We consider the failure of some node x as an atomic action which eliminates x from the skip graph, and 
effectively sets the corresponding pointers to it to _L If there are any pending messages to other nodes for 
them to change their pointers to point to x, when the messages are delivered, the corresponding pointers 
are set to _L. In an actual implementation, each node y will periodically check to see if its neighbors at each 
level £ are alive, and in absence of a response, it will set yRg = _L or yLg = _L. For the purposes of our proofs 
however, we will consider that when a node fails, the pointers to it are atomically set to _L. It is possible 
that a node will detect its failed neighbors before it has to initiate some action, thus setting the pointers to 
_L anyway. If it does not, we can ensure that a node checks that its neighbors at level I are alive before it 
processes a message that it receives for level I. 

We use xL e and xR' e etc to denote the value of node x's predecessor and successor respectively after some 
operation has occurred. 

We prove that the invariant is maintained for the insert and delete operations in the presence of node fail- 
ures. Then we give a repair mechanism that uses the invariant constraints to repair any violated backpointer 
or inter-level constraints due to node failures. 

Lemma 11 Failures preserve the invariant when no operation is in progress. 

Proof: Suppose that the invariant holds in the absence of any failures. If a node x has a failed successor 
or predecessor at level i, we consider xR^ = _L or xL\ = _L respectively, which trivially satisfies Constraints 1 
and 2. Thus, for all nodes y and all levels £, yR' e and yL' e are either equal to their previous values or they 
are set to _L, and the invariant is maintained. I 

Lemma 12 The invariant is maintained during an insert operation even in the presence of failures. 

Proof: Suppose that the invariant holds prior to the insert operation. We consider each link change 
during an insert operation and prove that this does not violate Constraint 1; we omit the details for Con- 
straint 2 as it is a mirror image of Constraint 2. We also consider only the case where the introducing node 
s is less than the new node it that is being added as the other case is similar. The successor link changes in 
Algorithm 2 (and its subroutine Algorithm 4) during the insert operation are as follows: 

• Line 23: Node u sends a getLinkOp message to s (line 21 of Algorithm 2). Node s either returns a 
setLinkOp message to u (line 6 of Algorithm 4), or passes the message to sLg (line 4 of Algorithm 4) if 
u < s. The latter case occurs when a new node has been inserted between s and sLq during concurrent 
inserts. Thus it's original getLinkOp messages is only passed to nodes smaller than u, and u receives the 
corresponding setLinkOp from a node p smaller than itself. Thus pR' Q — u > p (line 7 of Algorithm 4). 
If some node fails in this process or no suitable predecessor exists, all R links remain unchanged, thus 
satisfying Constraint 1. 

• Line 26: Node u sends a getLinkOp message to si?o > u (line 21 of Algorithm 2). Node sRq either 
returns a setLinkOp message to u (line 6 of Algorithm 4), or passes the message to sRq (line 4 of 
Algorithm 4) if sRq < u. The latter case occurs when a new node has been inserted between s and sRq 
during concurrent inserts. Thus it's original getLinkOp messages is only passed to nodes greater than 
itself, and it receives the corresponding setLinkOp from a node v greater than itself. Thus uR' — v > u. 
If some node fails in this process or no suitable successor exists, uR' = T, thus satisfying Constraint 1. 

• Line 34: For level I > 0, u sends a buddyOp message to uRg > u (line 32 of Algorithm 2). If m(uRg)g = 
m(u)i, then uRg sends a setLinkOp message to u (line 6 of Algorithm 4), and uR' e+1 = uRg > u. Else, 



23 



uRi sends the buddyOp message to uR\ (line 4 of Algorithm 4). Node u's original buddyQp message is 
only sent to nodes greater than itself, and it receives the corresponding setLinkOp message only from 
a node z greater than itself. Thus uR' e+1 — z > u. If some node fails in this process or no suitable 
successor exists, uR' e+1 = _L, thus satisfying Constraint 1. 

• Line 39: For level I > 0, u sends a buddyOp message to uLg < u (line 32 of Algorithm 2). If m(uLi)i = 
m(u)e, then uLg sends a setLinkOp message to u (line 6 of Algorithm 4), and uLgR' i+1 = u > uLg. 
Else, uLi sends the buddyOp message to uL"\ (line 4 of Algorithm 4). Node u's original buddyOp 
message is only sent to nodes smaller than itself, and it receives the corresponding setLinkOp message 
only from a node s smaller than itself. Thus sR' i+1 = u < s (line 7 of Algorithm 4). If some node 
fails in this process or no suitable predecessor exists, all R links remain unchanged, thus satisfying 
Constraint 1. 

Thus the invariant is maintained during an insert operation even in the presence of failures. I 

Lemma 13 The invariant is maintained during a delete operation even in the presence of failures. 

Proof: Suppose that the invariant holds prior to the delete operation. We consider each link change 
during a delete operation and prove that this does not violate Constraint 1; we omit the details for Con- 
straint 2 as it is a mirror image of Constraint 2. The successor links changes in Algorithm 7 during the 
delete operation of node u are as follows: 

• Line 20: At each level I > 0, u sends a deleteOp message to uRg (line 4 of Algorithm 6). If uRg is being 
deleted, it passes the message to uR\ (line 4 of Algorithm 7). This message is only passed to nodes 
greater than u until it reaches a node v > u which is not being deleted. Node v sends a f indNeighborOp 
message to vLi (line 8 of Algorithm 7), which is passed to the left (line 15) of Algorithm 7 until it 
reaches a node s < u which is not being deleted. Node s sends a f oundNeighborOp to v (line 19 of 
Algorithm 7), and it sets sR' t — v > s. If no suitable s is found (line 17 of Algorithm 7), all the R 
links remain unchanged, thus satisfying Constraint 1. 

• Line 29: If no suitable v, which is not being deleted, is found (line 6 of Algorithm 7), u sends a 
setNeighborNilOp to uLi < u (line 8 of Algorithm 7). This message is passed to the left (line 24 of 
Algorithm 7) until it reaches some node s < u which is not being deleted. Node s sets sR[ = _L (line 29 
of Algorithm 7). If no suitable s is found (line 17 of Algorithm 7), all the R links remain unchanged, 
thus satisfying Constraint 1. 

Thus the invariant is maintained during a delete operation even in the presence of failures. I 
Combining Lemmas 11, 12 and 13 directly gives Theorem 14. 

Theorem 14 The invariant is maintained throughout any execution of a skip graph, even with failures. 
5.2 Restoring invalid constraints 

The backpointer and inter-level constraints are violated during insert and delete operations as well as when a 
node fails. However, we will see that the repair mechanism needs to be triggered only for constraint violations 
caused due to failures, and not during the insert and delete operations. We give a repair mechanism in which 
each node periodically checks Constraints 3 to 6 and initiates actions to fix invalid constraints due to node 
failures. 

Although Constraints 3 to 6 may be violated midway during an insert or a delete operation, once all the 
pending operations are completed, these constraints are satisfied. Thus we observe that the repair mechanism 
is required to restore these constraints only in case of node failures. When a node fails during an insert or 
a delete, it leads to violations of the backpointer and inter-level constraints of its neighbors. Each node also 
periodically checks it backpointer and inter-level constraints. In Algorithm 8, node x checks its backpointer 
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constraints by sending checkNeighborOp messages to its neighbors at all levels (lines 3 and 5). Similarly, 
each node checks its inter-level constraints as explained in Section 5.2.2. 

As shown in Figure 7, when node y fails during an insert at level 2 (after having successfully inserted 
itself at levels and 1) or during a delete at level 1 (after having successfully deleted itself from Level 2), 
its neighbors x and z detect this failure. With the failure of y, x and z will detect inconsistencies in their 
constraints and initiate the mechanism to repair them. We prove that it is sufficient to detect and repair 
the violated constraints to restore the skip graph to its dcfcctless state. 



Level 2 



Level 1 



Level 



; ) Failed node y 

Q Detects failed successor 
constraint 



Figure 7: Violation of backpointer and inter-level constraints when a node fails half-way through an insert 
or delete operation. Observe that xR^ ^ x\, for any j > 1, and zL-i ^ z\, for any k > 1. 



The repair mechanism is divided into two parts: the first part is used to repair the the invalid backpointer 
constraints, and the second part is used to repair invalid inter-level constraints. 

5.2.1 Restoring backpointer constraints 

Each node x periodically checks that xRgLi = x when xRg ^ _!_, and that xLgRg = x when xLg ^ _L for 
all levels < £ < x.maxLevel. It triggers the backpointer constraint repair mechanism (Algorithm 8) if it 
detects an inconsistency. 

Lemma 15 In the absence of new failures, inserts and deletes, the repair mechanism described in Algo- 
rithm 8 repairs any violated backpointer constraint without losing existing connectivity. 

Proof: We prove that Algorithm 8 repairs the violated backpointer constraints for a single node without 
losing existing connectivity. We concentrate on the repair of the R links as the case for L links is symmetric. 
The violations of Constraint 3 for node v at level I are as follows: 

1. vRi = z > v but zLg = _L: Node v sends (checkNeighborOp, L, v, I) to z (line 5 of Algorithm 8). 
As zL( = _L and z > v, z sets zL' e = v (line 15 of Algorithm 8). Thus after Algorithm 8 finishes, 
vR' £ L' e = v, restoring Constraint 3. 

2. vRi = z > v but zLg = y > v: Node v sends (checkNeighborOp, L, v, I) to z (line 5 of Algorithm 8). 
As zLg = y > v, z passes the message on to y (line 18 of Algorithm 8). As long as this message reaches 
some node y > v such that yLg > v, y will pass it on to yLg, until it reaches a node x > v such that 
xLg < v < x or xLi = _L. Then x sets xL' t = v < x (lines 15 or 23 of Algorithm 8). Node x also sends 
(checkNeighborOp, R, x, i) to v (line 16 or 21 of Algorithm 8). Upon receiving that message, v sets 
vR' e = x. Thus after Algorithm 8 finishes, vR'^L^ = v, restoring Constraint 3. 

3. vRg = z > v but zLg = u < v: Node v sends (checkNeighborOp, L, v, t) to z (line 5 of Algorithm 8). 
As zLg = u < v, z sets zL' e = v < z (line 20 of Algorithm 8). Thus after Algorithm 8 finishes, 
vR' e L' £ = v, restoring Constraint 3. 
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Algorithm 8: Algorithm for repairing invalid backpointer constraints for node v. 
1 for i <— v.maxjevels downto do 



if (v.neighbor[L][i] ^ IS) then 

| send (checkNeighborOp, R, v, i) to v. neighbor [L] [i] 

if (v.neighbor[R][i] ^ _L) then 
I send (checkNeighborOp, L, v, i) to v. neighbor [R][i] 



6 upon receiving (checkNeighborOp, side, newNeighbor, 

7 if (side = R) then 



cmp <— < 
othcrSide 



10 else 



n 

12 



cmp <— > 
othcrSide <— i? 



13 if (v .neighbor [side] [£] ^ newNeighbor) then 



14 
15 
16 

17 

18 

19 
20 
21 
22 
23 



if ((v. neighbor [side] [£ J — _L ) and (w cmp newNeighbor)) then 
u.neighbor[sidc] [£] <— newNeighbor 

send (checkNeighborOp, othcrSide, v, £) to newNeighbor 

else if ((v. neighbor [side] [£] ^ _!_ ) and (v. neighbor [side] [£] cmp newNeighbor)) then 
| send (checkNeighborOp, side, newNeighbor, i) to 

else 

send (checkNeighborOp, othcrSide, newNeighbor, £) to u.neighbor[side] [£] 
send (checkNeighborOp, othcrSide, v, £) to newNeighbor 
send (checkNeighborOp, side, w.neighbor[side][i], £) to newNeighbor 
w.neighbor[side] [£] <— newNeighbor 
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We prove that the backpointer constraint repair mechanism does not lose any existing connectivity in 
the skip graph, i.e, if a path between nodes v and z used to exist before the repair mechanism was initiated, 
a path will still exist after the repair mechanism operations finish. We consider the case where a node v 
changes its successor pointer to a node z and points to another node y; we omit the case where v changes its 
predecessor pointer as it is similar to this case. Node v updates its successor pointer to some node z at level 
£, to point to some other node y, only when vRi = z > y > v (line 23 of Algorithm 8). Node v also sends 
messages to (i) z to update its predecessor pointer to point to y (line 20 of Algorithm 8), (ii) y to update 
its predecessor pointer to point to v (line 21 of Algorithm 8) and, (iii) y to update its successor pointer to 
point to z (line 22 of Algorithm 8). When these messages are delivered, vR' e = y and yR' e = z. Thus the 
path v-z is now replaced by a longer path v—y—z, and no existing connectivity is lost. I 

It is possible that a node x detects a failed backpointer constraint if it checks it while some node y is 
in the middle of its insert operation. Suppose that xRi = z but zLg, = y because y is yet to connect to x. 
When x sends a checkNeighborOp message to z, it gets passed to y, which then links to x and asks x to link 
to it (both through the repair mechanism and the insert operation). Thus, the repair mechanism generates 
additional messages but does not affect the insert operation. In case of a delete operation, a node does not 
delete itself until it has repaired the links at all the levels so an inconsistent backpointer constraint will not 
be detected during the delete operation. 

5.2.2 Restoring inter-level constraints 

We see how each node periodically checks Constraint 5; we omit the details for Constraint 6 as it is a mirror 
image of Constraint 5. For each level £ > 0, each node x sends a message to xRe_i to check if xRg = xR^_ 1 , 
for some k > 0. Each node that receives the message passes it to the right until one of the four following 
cases occur: 

1. The message reaches node a, a < xRi and m(a) f £ = m(x) f I- 

2. The message reaches node a, a < xRg and aRg-\ = _L. 

3. The message reaches node a, a = xRi. 

4. The message reaches node a, a > xRg. 

In case 3, Constraint 5 is not violated and no repair action is violated. We provide a repair mechanism 
for the each of the remaining three cases. The repair mechanism for fixing violations of Constraint 6 is 
symmetric. It may be possible to combine the two mechanisms to improve the performance but we will treat 
them separately for simplicity. 

In each case, we assume that the link is present at level £ but absent at level £ — 1. Note that if a node x 
is linked at level £ — 1 but not at level £, it can easily traverse the list at level I — 1 to determine which node 
to link to at level I. This process is identical to the insertion process where a new node uses its neighbors at 
lower levels to insert itself at higher levels in the skip graph. 

The violations of Constraint 5 are as follows: 

1. xRi — xR\_ x . but 3a = xR\_ Xl 
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Figure 8: Two-way merge to repair a violated inter- level constraint. 



The nodes connected to a and x at level I have to be merged together into one list by sending the 
following messages: 

• Probe level £ (in list containing a) to find largest aR^ = R < xRg. Node a starts the probe by 
sending a message to aRg. Upon reaching node y, if y < xRg < yRg, the probe ends with y = R. 
Otherwise the message is passed to yRg. 

• Send (zipperOpF, xRz, I) to R. 

• Probe level £ (in list containing a) to find smallest ah\ = L > x. Node a starts the probe by 
sending a message to aLi. Upon reaching node y, if yLg < x < y, the probe ends with y = L. 
Otherwise the message is passed to yLi. 

• Send (zipperOpB, x, £} to L. 

2. xRg ^ xR^_-y for any k > 0, and 3a < xRe, aRg = _L. 



xR e 

-Q— 



LEVEL 



«0 O-Q 

zipperOpB 



LEVEL £ - 1 



Figure 9: One-way merge to repair a violated inter- level constraint. 



The nodes connected to a and xRg at level 1—1 have to be merged together into one list by sending 
the following messages: 

• Probe level I — 1 (in list containing xRt) to find smallest xReL^_ 1 = L > a. Node xRg starts the 
probe by sending a message to xRiLi_\. Upon reaching node y, such that y > a > yLg_i, the 
probe stops with y = L. Otherwise the message is passed on to yLg_\. 

• Send (zipperOpB, a, £ — 1) to L. 

3. xRi xR\_ 1 for any j > 0, and 3a — xR\_ x > xRg. 
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Figure 10: Two-way merge to repair a violated inter-level constraint. 



The nodes connected to a and xRi at level 1—1 have to be merged together into one list by sending 
the following messages 2 : 



Probe level £ — 1 (in list containing xRi) to find largest xRiR\_ x = R < a. Node xRi starts the 
probe by sending a message to xR[R(^\. Upon reaching node y, if y < a < yRe~i, the probe 
ends with y = R. Otherwise the message is passed to yRi-\. 

• Send (zipperQpF, a, £ — 1) to R. 

• Probe level £ — 1 (in list containing xRi) to find smallest xRiVf_ x = L > aLi-i. In this case, 
the probe proceeds along the predecessors of xRi at level £ — 1 till it reaches node y such that 
y = L > aLg-i > yLi-\. 

• Send (zipperDpB, aL^-i, £ — 1} to L. 



Algorithm 9: zipperDpB for node v 

1 upon receiving (zipperOpB, x, I): 

2 if v. neighbor [L][£] > x then 

3 | send (zipperOpB, x, £) to v. neighbor [L] [£] 

4 else 

tmp = v. neighbor [L][£] 
v. neighbor [L][£] = x 
send (updateOp, R, v, £) to x 
if tmp 7^ _L then 
I send (zipperOpB, tmp, £) to x 



Algorithm 10: zipperQpF for node v 



l upon receiving (zipperOpF, x, £): 

i if v. neighbor [R][£] < x then 

3 I send (zipperOpF, x, £) to v. neighbor [R] 



4 else 



tmp = v. neighbor [R][£] 
w.neighbor[i£][£]= x 
send (updateOp, L, v, £) to x 
if tmp ^ ± then 
I send (zipperOpF, tmp, £) to x 



2 Details of the zipperOp messages are given in Algorithms 9 and 10. 
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Figure 11: zipperOp operation to merge nodes on the same level. 



Lemma 16 In the absence of new failures, inserts and deletes, the repair mechanism described in Sec- 
tion 5.2.2 repairs any violated inter-level constraint without losing existing connectivity. 

Proof: The algorithm initiates repair for all the possible violations of the inter-level constraints of a 
node as given above. It only remains to be proved that the zipperOp messages merge two sorted lists at a 
given level into a single sorted list, without losing existing connectivity. We concentrate on the zipperOpF 
messages as the zipperOpB messages are symmetric. 

To prove that the repair mechanism merges two sorted lists into a single sorted one, we first see that a 
node v always receives a zipperOpF message to link to a node greater than itself. The initial zipperOpF 
messages sent are as follows: (i) In Case 3, R receives a zipperOpF message to link to a > R, and (ii) in 
Case 1, R receives a zipperOpF message to link to xRi > R. When a node v receives a zipperOpF message 
to link to x, if vRi < x, it sends the message to vRi to link to x (line 3 of Algorithm 10). Otherwise, it 
updates vR' £ — x (line 6 of Algorithm 10), and it sends a zipperOpF message to x < vRi to link to vRi 
(line 9 of Algorithm 10). In both cases, the zipperOpF message reaches a node that has to link to a node 
greater than itself. Also, each node v only links to a new node x if it is smaller than the current successor 
of v, so v < vRi = x < vRi. Thus the two sorted lists get merged into a single sorted list, until one of the 
lists terminates. 

We also prove that the inter-level constraint repair mechanism does not lose any existing connectivity in 
the skip graph, i.e, if a path between nodes v and z used to exist before the repair mechanism was initiated, 
a path will still exist after the repair mechanism operations finish. A link change occurs only when v receives 
a zipperOpF message to link to x < vR( — z. Node v sends a zipperOpF message to x to link to z. Upon 
receipt of that message, x either sets to xR' e = z, or it sends passes the message to xRi < z. In the first 
case, the path between v and z is replaced by a new longer path v-x-z. In the latter case, the zipperOp 
message passes through several nodes xRi,xRj, . . ., until it reaches a node y such that y < z < yRi and 
y sets yR' e = z. Then the path v—z is replaced by a longer path v-x-xRg-xRf-. . .-y—z, and no existing 
connectivity is lost. I 

It is possible that a node x detects a failed inter-level constraint if it checks it while some node y is in 
the middle of its insert or delete operation. Node x will detect a failed constraint at level I if y has inserted 
itself at level £ — 1 and not at level I. The probe message along level 1—1 will reach y which can then inform 
x that it is yet to complete its insert operation, and thus terminate the repair mechanism. 

5.3 Proof of correctness 

In this section we prove that the repair mechanism given in Algorithm 8 and Section 5.2.2 repairs a defective 
a defective skip graph. 

We prove that the repair mechanism repairs a defective skip graph by showing that it repairs level after 
some finite interval of time, and then uses the links at level to restore the links at higher levels. Lemma 17 
and Corollary 18 show that if there exists a path between two nodes that consists entirely of pointers in any 
one direction (L or R), then the repair mechanism ensures that after some finite interval of time, there is a 
path between those two nodes in the same direction at level 0. For their proofs, we consider only the case 
for the R links as the case for the L links is symmetric. This result is further extended in Lemma 19 which 
shows that as long as there is path between two nodes, irrespective of the directions of the edges in the 



30 



path, there will be a path in both directions between the two nodes at level 0. Corollary 20 shows that this 
leads to a single, sorted, doubly-linked list at level as in a defectless skip graph. Finally, Lemma 21 and 
Theorem 22 show how the list at level is used to create lists at higher levels, to eventually give a defectless 
skip graph. 

Lemma 17 

1. Suppose we have hqRi 1 R[ 2 . . . Re r = y r , yo < y r , for some r, and for each l<i<r,£<£i<£+l. Then 
after some finite interval of time, yoRg = y r , for some k. 

2. Suppose we have yoLi 1 Le 2 . . . Lg r = y r , yo > y r , for some r, and for each l<i<r,£<£i<£+l. Then 
after some finite interval of time, yoL^ = y r , for some k. 

Proof: Let yoRe 1 Re 2 . . . R^ i = iji. Then there exists a link between each yi-i and yi at level £ or 
£ + 1. For each R £i = Re+\, as yi-iRg+i = yi, the inter-level repair mechanism given in Section 5.2.2 
ensures that after some finite interval of time, y^-ii?^ = yi, for some ki (Lemma 16). We then have 
yoR^Re^ ...Re r ~ y^R^ 1 R^ 2 ...R^ T = y r , where ki = 1 if Re, = Re- Thus we get yoR^ = y r , where 
k = k\ + /c2 + • • • + k r . I 

Corollary 18 

1. Suppose we have yoRe 1 Re 2 ■ ■ ■ Ri r = yr, yo < Vr, for some r, and for each i, 1 < i < r, li > 0. Then 
after some finite interval of time, yoR^ = y r , for some k. 

2. Suppose we have yoLg 1 Li 2 . . . Li r = y r , yo > Vr> for some r, and for each i, 1 < i < r, £i > 0. Then after 
some finite interval of time, yo-^o = Vri f or some k. 

Proof: Let yoRe 1 Re 2 ■ ■ - R^ = yi- Then for each y%-\, yi-xR^ = yi. By Lemma 17, after some finite 
interval of time, yi-iR^'_\ = yi, for some kij^\. By repeatedly applying Lemma 17, after some finite 

interval of time, we get yi-\R k ^'° = yi, for some fc^o- So we get yoRQ 10 Rq 2,0 ■ ■ ■ Rq'° = y r - Thus, yoR^ = y r , 
where k = ki t o + ^2,0 + • • • + fcr.o- I 

Lemma 19 Suppose we have yoE^E^ . . . Eg r = y r , for some r, y < Vn E G {L, R}, and for each 1 < i < r, 
£i > 0. Then after some finite interval of time, yo^o = Vr an d UrLg = yo for some k. 

Proof: Let y^E^E^ . . . E^ = yi. For each yj, y^_ii?£ i = y,. By Corollary 18, after some finite interval 
of time, yi-iEQ* = yi for some ki. So we have yoEy 1 Eq 2 . . . E^ r = y r . If any of the y^'s are not distinct, then 
we can eliminate the path between two consecutive occurrences of y^. So we can replace a path of the form 
yiE^' . . . yjEy 1 , where yi = yj, with y^o 3 ■ Thus we have a path consisting of Lo and Rq edges starting at 
yo and terminating at y r which consists only of unique nodes. 

For each node, after some finite interval of time, the backpointer constraint repair mechanism given in 
Algorithm 8 will repair any violated backpointer constraints without losing existing connectivity (Lemma 15). 
So Constraints 3 and 4 are repaired for all the nodes in the path, and as proved in Theorem 14, Constraints 1 
and 2 are always maintained. As proved in Theorem 5, with Constraints 1 through 4 satisfied for all the 
nodes in the path, we get a sorted, doubly-linked list of the nodes. Thus, yo-Ro = y r and y r L§ = yo, for 
some k. I 

Corollary 20 After some finite interval of time, all nodes in the same connected component of a skip graph 
are linked together in a single, sorted, doubly-linked list at level 0. 

Proof: Lemma 19 shows that any two connected nodes x and y arc in the same sorted, doubly-linked 
list at level after some finite interval of time. Any other node z in the same connected component is also 
connected to both x and y, so it has to be in the same list at level 0. Thus, all the nodes in the same 
connected component are in a single list at level 0. I 

Given a set S, let Mg(S) be the set of all membership vector prefixes of length £ represented by the nodes 
in S, i.e., M £ {S) = {w | 3x G S, m(x) \ £ = w}. 



31 



Lemma 21 Suppose we have all nodes in the same connected component C of a skip graph linked together 
in \Mg(C)\ sorted, doubly-linked lists at level £, one for each w £ Mi(C). Then after some finite interval of 
time, we get \Me + i(C)\ sorted, doubly-linked lists at level £ + 1, one for each w £ M( + \{C). 

Proof: Consider a single list L at level £, and let node x £ L. As explained in the successor constraint 
repair mechanism given in Section 5.2.2, if xLi, xRi ^ {-L} ) x uses its neighbors at level £, to find its neighbors 
at level £+1. As all the lists at level £ are sorted and doubly-linked, x can find xRg+\ and xLi+i in 0(2|S|) 
time, just like in an insert operation (Algorithm 2). When all the nodes of list L determine their neighbors 
at level £ + 1 after some finite interval of time, we get the lists from the nodes of L at level £ + 1, one for 
each w £ M^ + i(L). This is identical to the insert operation and as proved in Lemma 6, all the lists thus 
created are sorted and doubly-linked, and only consist of nodes that have the matching membership vector 
prefix of length £ + 1. Thus, considering all the \Mi(C)\ lists at level £, after some finite interval of time, we 
get |M£ +1 (C)| sorted, doubly-linked lists at level I + 1, one for each w £ M^ +1 (C). I 

Theorem 22 The repair mechanism given in Section 5 repairs a defective skip graph to give a defectless 
skip graph after some finite interval of time. 

Proof: Corollary 20 shows that after some finite interval of time, the repair mechanism links all the 
nodes in the same connected component in a single, sorted, doubly-linked list at level 0. Further, there are 
only finitely many levels in a skip graph. Using Lemma 21 inductively, with Corollary 20 as the base case, we 
can show that we get sorted, doubly-linked lists, which contain all the nodes with the matching non-empty 
membership vector prefixes, at all levels of the skip graph. Thus, the repair mechanism repairs a defective 
skip graph to give a defectless skip graph after some finite interval of time. I 

6 Congestion 

In addition to fault tolerance, a skip graph provides a limited form of congestion control, by smoothing out 
hot spots caused by popular search targets. The guarantees that a skip graph makes in this case are similar 
to the guarantees made for survivability. Just as a node's continued connectivity depends on the survival 
of its neighbors, its message load depends on the popularity of its neighbors as search targets. However, we 
can show that this effect drops off rapidly with distance; nodes that are far away from a popular target in 
the bottom-level list of a skip graph get little increased message load on average. 

We give two versions of this result. The first version, given in Section 6.1, shows that the probability 
that a particular search uses a node between the source and target drops off inversely with the distance from 
the node to the target. This fact is not necessarily reassuring to heavily-loaded nodes. Since the probability 
averages over all choices of membership vectors, it may be that some particularly unlucky node finds itself 
with a membership vector that puts it on nearly every search path to some very popular target. The second 
version, given in Section 6.2, shows that our average-case bounds hold with high probability. While it is 
still possible that a spectacularly unlucky node is hit by most searches, such a situation only occurs for very 
low-probability choices of membership vectors. It follows that most skip graphs alleviate congestion well. 
For our results, we consider skip graphs with alphabet {0, 1}. 

6.1 Average congestion for a single search 

Our argument that the average congestion is inversely proportional to distance is based on the observation 
that a node only appears on a search path in a skip list S if it is among the tallest nodes between itself and 
the target. We will need a small technical lemma that counts the expected number of such tallest nodes. 
Consider a set- valued Markov process Ao 3 A\ D A2 . . . where Aq is some nonempty initial set and each 
element of At appears in At+i with independent probability \. Let r be the largest index for which A T is 
not empty. We will now show that E[|^4 T |] is small regardless of the size of the initial set Aq. 
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Lemma 23 Let Aq,Ai, ...,A t be defined as above. Then E[|j4. t |] < 2. 

Proof: The bound on E[|^4 T |] will follow from a surprising connection between E[|^4 r |] and Pr[|^4 T | = 1]. 
We begin by obtaining a recurrence for Pr[|/1 T |] = 1. 

Let P{n) — Pr[|^4 r | — 1 : \A$\ = n]. Clearly P(0) = and P(l) = 1 > §. For larger n, summing over 
all k = |Ai| gives P(n) = 1~ n YZ=o (l) p ( k )> which can bc rewritten as P(n) = Y2=l (l) p ( k )- The 
solution to this recurrence goes asymptotically to ± 10~ 5 « 0.7213 • • • [SF96, Theorem 7.9]; however, 
we will use the much simpler property that when n > 2, P(n) ~ Pr[|^4 r | = 1] is the probability of an event 
that does not always occur, and is thus less than 1. 

Let E(n) = E[|A r | : \Aq\ = n}. Then E{n) = 2"" (n + YJl=i (t) E ( k ))> and eliminating the E(n) term 
on the right-hand side gives E(n) = (n + YJkZi (fc)^( fc )) ■ 

For n = 1, E(l) = 1 by definition. Recall that P(l) is also equal to 1. We will now show that 
E(n) = 2P(n) for all n > 1. Suppose that P(fc) = 2P(fc) for 1 < k < n. Let n = 2, then P(fc) = 2P(fc) for 
all 1 < k < n (an empty set of k). Then, 

- ^(»+(")^)^g(:)^)) 

fc=l v 7 

= 2P(n). 

Since P(n) < 1 for n > 1, we immediately get E(n) < 2 for all n. For large n this is an overestimate: 
given the asymptotic behavior of P(n), E(n) approaches ± 2 x 1CP 5 w 1.4427 • • •. But it is close enough 
for our purposes. I 

Theorem 24 Let S be a skip graph with alphabet {0, 1}, and consider a search from s to t in S . Let u be a 
node with s < u < t in the key ordering (the case s > u > t is symmetric) , and let d be the distance from u 
to t, defined as the number of nodes v with u < v < t. Then the probability that a search from s to t passes 
through u is less than . 

Proof: Let S m ^ be the skip list restriction of s whose existence is shown by Lemma 1. From Lemma 2, 
we know that searches in S follow searches in S m r s -\ . Observe that for u to appear in the search path from 
s to t in 5' m ( s ) there must be no node v with u < v < t whose height in S TO ( S ) is higher than u's. It 
follows that u can appear in the search path only if it is among the tallest nodes in the interval [u,t], i.e., 
if \m(s) A m(u)\ > \m(s) A m(v)\ for all v with u < v < t. Recall that m(s) A m(v) is the common prefix 
(possibly empty) of m(s) and m(v). 

There are d+1 nodes in this interval. By symmetry, if there are k tallest nodes then the probability that 
u is among them is ^tj- Let T be the random variable representing the set of tallest nodes in the interval. 
Then: 



Pr[ueT] = J2MT\ = k}-^ 

k=l 

E[|P|] 
d+1 ' 
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What is the expected size of Tl All d + 1 nodes have height at least 0, and in general each node with 
height at least k has height at least k + 1 with independent probability \. The set T consists of the nodes that 
are left at the last level before all nodes vanish. It is thus equal to A T in the process defined in Lemma 23, 
and we have E[\T\] < 2 and thus Pr[u 6 T] < -^j. I 

For comparison, experimental data for the congestion in a skip graph with 131072 nodes, together with 
the theoretical average predicted by Theorem 24, is shown in Figure 12. 
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Figure 12: Actual and expected congestion in a skip graph with 131072 nodes with the target=76539. 
Messages were delivered from each node to the target and the actual number of messages through each node 
was measured. The bound on the expected congestion is computed using Theorem 24. Note that this bound 
may overestimate the actual expected congestion. 



6.2 Distribution of the average congestion 

Theorem 24 is of small consolation to some node that draws a probability straw and participates in 
every search. Fortunately, such disasters do not happen often. Define the average congestion L tu imposed 
by a search for t on a node u as the probability that an s — t search hits u conditioned on the membership 
vectors of all nodes in the interval [u,t], where s < u < t, or, equivalently, s > u > i. 3 Note that since 
the conditioning docs not include the membership vector of s, the definition in effect assumes that m(s) is 
chosen randomly. This approximates the situation in a fixed skip graph where a particular target t is used 
for many searches that may hit it, but the sources of these searches are chosen randomly from the other 
nodes in the graph. 

Theorem 24 implies that the expected value of L tu is no more than . In the following theorem, we 
show the distribution of L tu declines exponentially beyond this point. 

Theorem 25 Let S be a skip graph with alphabet {0, 1}. Fix nodes t and u, where u < t and \{v : u < v < 
t}\=d. Then for any integer £>0, Pr[L tu > 2~ e ] < 2e~ 2 ~ fd . 

Proof: Let V = {v : u < v < t} and let m(V) — {m(v) : v G V}. As in the proof of Theorem 24, we 
will use the fact that u is on the path from s to t if and only if u's height in 5 m ( s ) is not exceeded by the 
height of any node v in V. 

3 It is immediate from the proof of Theorem 24 that Ltu does not depend on the choice of s. 
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To simplify the notation, let us assume without loss of generality that m(u) is the all-zero vector. Then 
the height of u in S m r s \ is equal to the length of the initial prefix of zeroes in m(s), and u has height at least 
£ with probability 2~ l . Whether this is enough to raise it to the level of the tallest nodes in V will depend 
on what membership vectors appear in m(V). 

Let m(s) = 0*1 ■ ■ •. Then u has height exactly i, and is hit by an s — t search unless there is some v G V 
with m(v) = z l ■ ■ ■■ We will argue that when d = \V\ is sufficiently large, then there is a high probability 
that all initial prefixes 0*1 appear in m(V) for i < I. In this case, u can only appear in the s — t path if its 
height is at least £, which occurs with probability only 2~ l . So if l l appears as a prefix of some m(v) for 
all i < £, then L tu < 2~ e . Conversely, if L tu > 2~ e , then l l does not appear as a prefix of some m(v) for 
some i < £. 

Now let us calculate the probability that not all such prefixes appear in m(V). We are going to show 
that this probability is at most 2e~ 2 d , and so we need to consider only the case where e -2 d < h; this 
bound is used in steps (2) and (3) below. We have: 

Pr[L tu > 2- g \ < Prh (Vi < £ : 3v G V : l l r< m(v))} 

= Pr[3i < £ : Vu G V : l l y< m(v)] 
e-i 

< ^^Pr[Vw G V : 0*1 ^ m(v)] 

8=0 

i=0 
l-l 



-2~ i - 1 d 



i=0 
l-l 

E 

l-l 



e~ 2 ' 



< 



E 



3=0 



oo 



3 + 1 



3=0 

-2 



-2 



< 2e- z a . (3) 



7 Related Work 

SkipNet is a system very similar to skip graphs that was independently developed by Harvey et a/.[HJS + 03]. 
SkipNet builds a trie of circular, singly- linked skip lists to link the machines in the system. The machines 
names are sorted using the domain in which they are located (for example www.yale.edu). In addition to 
the pointers between all the machines in all the domains that are structured like a skip graph, within each 
individual domain, the machines arc also linked using a DHT, and the resources are uniformly distributed 
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over all the machines using hashing. A search consists of two stages: First, the search locates the domain in 
which a resource lies by using a search operation similar to a skip graph. Second, once the search reaches 
some machine inside a particular domain, it uses greedy routing as in DHTs to locate the resource within 
that domain. SkipNct has been successfully implemented, and this shows that a skip-graph-likc structure 
can be used to build real systems. 

The SkipNet design ensures path locality i.e., the traffic within a domain traverses other nodes only 
within the same domain. Further, each domain gets to hosts it own data which provides content locality 
and inherent security. Finally, using the hybrid storage and search scheme provides constrained load 
balancing within a given domain. However, as the name of the data item includes the domain in which it 
is located, transparent remapping of resources to other domains is not possible, thus giving a very limited 
form of load balancing. Another drawback of this design is that it does not give full-fledged spatial locality. 
For example, if the resources are document files, sorting according to the domain on which they are served 
gives no advantage in searching for related files compared to DHTs. 

Zhang et aZ. [ZSZ03] and Awerbuch et al. [AS03] have both independently suggested designs for peer-to- 
peer systems using separate data structures for resources and the machines that store them. The main idea 
is to build a data structure D over the resources, which are distributed uniformly among all the machines 
using hashing, and to build a separate DHT over all the machines in the system. Each resource maintains 
the keys of its neighboring resources in D, and each machine maintains the addresses of its neighboring 
machines as per the DHT network. To access a neighbor b of resource a, a initiates a DHT search for the 
hash value of b. One pointer access in D is converted to a search operation in the DHT, so if any operation in 
D takes time t, the same operation takes O(ilogm) time with m machines in this hybrid system. Zhang et 
al. [ZSZ03] focus on implementing a tree of the resources, in which each node in the tree is responsible for 
some fixed range of the keyspace that its parent is responsible for. Awerbuch et al. [AS03] propose building 
a skip graph of the resources on top of the machines in the DHT. 

This design approach is interesting because it allows building any data structure using the resources, 
while providing uniform load balancing. In particular, both these systems support complex queries as in 
skip graphs, and uniform load balancing as in DHTs. We believe that distributing the resources uniformly 
among all the nodes (as described in Section 2.1) will also have the same properties as these two approaches. 

However, the Awerbuch et al. approach and our uniform resource distribution approach suffer from the 
same problems of high storage requirements and high volume of repair mechanism message traffic as a skip 
graph. With m machines and n resources in the system, in the Awerbuch et al. approach, each machine has 
to store O(logm) pointers (for the DHT links) and O(logn) keys for each resource that it hosts (for the skip 
graph pointers). Further, the repair mechanism has to repair the broken DHT links as well as inconsistent 
skip graph keys. Finally, the search performance is degraded to (9(log 2 m) compared to O(logm) in DHTs 
and O(logn) in skip graphs. In comparison, in our approach of uniformly resource distribution, each machine 
has to store O(logn) pointers for each resource that it hosts, repair is required only for the skip graph links, 
and the search time is O(logn) as in skip graphs. 

In the Zhang et al. approach, each machine has to store k keys for the k children of each tree node that 
it hosts, and O(logm) pointers for the DHT links. Repair involves fixing broken tree keys as well as broken 
DHT links. This scheme suffers from the other problems of tree data structures such as increased traffic on 
the nodes higher up in the tree, and vulnerability to failures of these nodes. Further, unlike skip graphs, it 
require a priori knowledge about the keyspace in order to assign specific ranges to the tree nodes. Thus, it 
is an open problem to design a system that efficiently supports both uniform load balancing and complex 
queries. 

8 Conclusion 

We have defined a new data structure, the skip graph, for distributed data stores that has several desirable 
properties. Constructing, inserting new nodes into, and searching in a skip graph can be done using simple 
and straightforward algorithms. Skip graphs are highly resilient, tolerating a large fraction of failed nodes 
without losing connectivity. Using the repair mechanism, disruptions to the data structure can be repaired in 
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the absence of additional faults. Skip graphs also support range queries which allows, for example, searching 
for a copy of a resource near a particular location by using the location as low-order field in the key and 
clustering of nodes with similar keys. 

As explained in Section 7, one issue that remains to be addressed is the large number of pointers per 
machine in the system. It would be interesting to design a peer-to-peer system that maintains fewer pointers 
per machine and yet supports spatial locality. Also, skip graphs do not exploit network locality (such as or 
latency along transmission paths) in location of resources and it would be interesting to study performance 
benefits in that direction, perhaps by using multi-dimensional skip graphs. As with other overlay networks, 
it would be interesting to see how the network performs in the presence of Byzantine failures. Finally, it 
would be useful to develop a more efficient repair mechanism and a self-stabilization mechanism to repair 
defective skip graphs. 
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